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Abstract —In this letter, we address sparse signal recovery 
using spike and slab priors. In particular, we focus on a 
Bayesian framework where sparsity is enforced on reconstruction 
coefficients via probabilistic priors. The optimization resulting 
from spike and slab prior maximization is known to be a hard 
non-convex problem, and existing solutions involve simplifying 
assumptions and/or relaxations. We propose an approach called 
Iterative Convex Refinement (ICR) that aims to solve the afore¬ 
mentioned optimization problem directly allowing for greater 
generality in the sparse structure. Essentially, ICR solves a 
sequence of convex optimization problems such that sequence of 
solutions converges to a sub-optimal solution of the original hard 
optimization problem. We propose two versions of our algorithm: 
a.) an unconstrained version, and b.) with a non-negativity 
constraint on sparse coefficients, which may be required in some 
real-world problems. Experimental validation is performed on 
both synthetic data and for a real-world image recovery problem, 
which illustrates merits of ICR over state of the art alternatives. 


Index Terms —Compressive sensing, Bayesian inference, sparse 
signal, optimization, spike and slab prior, image reconstruction 


I. Introduction 

S Parse signal approximation and compressive sensing (CS) 
have recently gained considerable interest both in signal 
and image processing as well as statistics. Sparsity is often 
a natural assumption in inverse problems and sparse recon¬ 
struction or representation has variety of applications in im¬ 
age/signal classification JD-©> dictionary learning 0-03- 
signal recovery 03 - 1141. image denoising and inpainting 
, super resolution ITiSj and MRI image reconstruction 
. Typically, sparse models assume that a signal can be 
efficiently represented as sparse linear combination of atoms 
in a given or learned dictionary 0- GU- In other words, from 
CS viewpoint, a sparse signal can be recovered from fewer 
number of observations (T9j-(2TJ. 

A typical sparse reconstruction algorithm aims to recover a 
sparse signal x £ R p from a set of fewer measurements ygR ? 
(q <C p) according to the following model: 

y=Ax+n, (1) 

where A £ W //p is the measurement matrix (Dictionary) and 
n £ models the additive Gaussian noise with variance a 2 . 
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In recent years, many sparse recovery algorithms have 
been proposed including but not limited to the following: 
proposing sparsity promoting optimization problems involving 
different regularizers such as l\ norm, Iq pseudo norm, greedy 
algorithms 0-01 -||24), Bayesian-based methods 0 - 05 )- 
p6] or general sparse approximation algorithms such as 
SpaRSA, ADMM, etc. (13), j27)-(29). 

In this letter, we focus on sparse recovery from a Bayesian 
perspective by using hierarchical priors. In Bayesian sparse 
recovery, the choice of priors plays a key role in promoting 
sparsity and improving performance. Examples of such priors 
are Laplacian 130), generalized Pareto (31), Spike and Slab 


1321, etc. Amongst these priors, a well-suited sparsity promot¬ 
ing prior is spike and slab prior which is widely used in sparse 
recovery and Bayesian inference for variable selection and 
regression |17|, p0| , [33], |34|. In fact, it is acknowledged that 
spike and slab prior is indeed the gold standard for inducing 
sparsity in Bayesian inference [[35]. 

Using these priors for sparse recovery leads to non-convex, 
non-smooth, mixed integer programming optimization prob¬ 
lems which are often solved by means of relaxation or 
simplifying assumptions on the model parameters 0 . 07 ), 
|36]. However, in this work we aim to solve the spike and slab 
optimization problem directly in its general form. Motivated 
by this, the Main Contributions of our work are as follows: 
(1) We propose a novel Iterative Convex Refinement (ICR) 
method to solve the optimization problem resulting from 
exploiting spike and slab priors. Essentially, the sequence 
of solutions from these convex problems approaches a sub- 
optimal solution of the hard non-convex problem. (2) We 
propose two versions of ICR: a.) an unconstrained version, 
and b.) with a non-negativity constraint on sparse coefficients, 
which may be required in some real-world problems such as 
image recovery. (3) Finally, we perform experimental vali¬ 
dation on both synthetic data and a realistic image recovery 
problem, which reveals the benefits of ICR over other state-of- 
the-art recovery methods using spike and slab priors. Further, 
we compare the solution of various sparse recovery methods 
against the global solution for a small-scale problem, and 
remarkably the proposed ICR finds the most agreement with 
the global solution. Finally, convergence analysis is provided 
in support of the proposed ICR algorithm. 
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II. Spike and Slab Sparse Signal Recovery 

Introducing priors for capturing sparsity is a particular 
example of Bayesian inference where the signal recovery can 
be enhanced by exploiting contextual and prior information. 
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As suggested by (37), (38), sparsity can be induced via solving 
the following optimization problem: 

maxP^x) subject to ||y — Ax lb < e- ( 2 ) 

X 

where P x is the probability distribution function of x that 
captures sparsity. The most common example is the i.i.d. 
Laplacian prior which is equivalent to i\ norm minimization 
(3T). In this work, we focus on the spike and slab prior for 
inducing sparsity on x. Using this prior, every coefficient x, is 
modeled as a mixture of two densities as follows: 

Xi ~ (1 - y )5 0 + y iPi{xi) (3) 

where So is the Dirac function at zero (spike) and P, (slab) 
is an appropriate prior distribution for nonzero values of x, 
(e.g. Gaussian), y £ [0,1] controls the structural sparsity of the 
signal. If y- is chosen to be close to zero x, tends to remain 
zero. On the contrary, by choosing y close to 1, P, will be the 
dominant distribution encouraging x,- to take a non-zero value. 
Optimization Problem (Hierarchical Bayesian Framework): 
Inspired by Bayesian compressive sensing (CS) (2T), (33), we 
employ a hierarchical Bayesian framework for signal recovery. 
In this model, priors are employed on y and x. We also define 
Y,■ to be the indicator variable for the coefficient x,-, i.e. Y, — 
I(x,- f 0). It takes the value one only if the corresponding 
coefficient x, is not zero, and zero otherwise. More precisely, 
the Bayesian formulation is as follows: 

y\A,x, y,o 2 ~ 9f(Ax,o 2 l) (4) 

x|y,A.,o 2 ~ + (5) 

i= 1 
P 

yik ~ n Bernoulli(K,) ( 6 ) 

i=i 

where 5\C.) represents the Gaussian distribution. Also note 
that in (01 each coefficient of x is modeled as i.i.d spike and 
slab prior. In addition, a Bernoulli distribution is used to model 
the indicator variable Y/ with parameter K,-, which controls the 
sparseness of the signal. Motivated by a recent maximum a 
posteriori (MAP) estimation technique proposed in (36) the 
optimal x,Y are obtained by the following MAP estimate. 

(x*,f) = argmax{/(x,Y|A,y,K,A.,o 2 )}. (7) 

x.y 

Proposition 1. The MAP estimation above is equivalent to the 
following minimization problem: 

(x*,f) = argmin ||y-Ar||| + A,||x|||-|-^PiYi (8) 

x -y H 

where p, = a 2 log 

Proof. See supplementary material. 0 □ 

Remark: The optimization problem in (S) is a non-convex 
mixed integer programming involving the binary indicator 
variable y and is not easily solvable using conventional op¬ 
timization algorithms. It is worth mentioning that this is a 

'Also available at http://signal.ee.psu.edu/ICR/ICRpage.htm 


more general formulation than the framework proposed in |3) 
or (36) where authors simplified the optimization problem by 
assuming the same K for each coefficient x,. This assumption 
changes the last term in ( 8 ) to p||x||o and the resulting opti¬ 
mization is solved in (36) by using Majorization-Minimization 
Methods. Further, a relaxation of Iq to t\ norm reduces the 
problem to the well-known Elastic-Net (39) . The framework 
in therefore offers greater generality in capturing the 
sparsity of x. As an example, consider the scenario in a 
reconstruction or classification problem where some dictionary 
(training) columns are more important than others (40). It 
is then possible to encourage their contribution to the linear 
model by assigning higher values to the corresponding K,’s, 
which in turn makes it more likely that the i th coefficient x, 
becomes activated. 

III. Iterative Convex Refinement (ICR) 

We first develop a solution to (S} for the case when the 
entries of x are non-negative. Then, we propose our method 
in its general form with no constraints. 

The central idea of the proposed Iterative Convex Refine¬ 
ment (ICR) algorithm - see Algorithm 1 - is to generate a 
sequence of optimization problems that refines the solution 
of previous iteration based on solving a modified convex 
problem. At iteration n of ICR, the indicator variable y 
is replaced with the normalized ratio and the convex 

fj. ' 

optimization problem in is solved which is a simple 
quadratic programming with non-negativity constraint. Note 
that, pf 1 ’ is intuitively the average value of optimal x*’s 
obtained from iteration 1 up to n — 1 and is rigorously defined 
as in (IT) . The motivation for this substitution is that, if the 
sequence of solutions X'" ! converges to a point in W we also 
expect Ju to converge to y. Essentially, ICR is solving a 
A 

sequence of convex quadratic programming problem that their 
solution converges to a sub-optimal solution of ( 8 ). 

To generalize ICR to the unconstrained case, a simple mod¬ 
ification is needed at each iteration. In fact, at each iteration 
m is solved instead of (9). Note that (TO) is still convex and 
we solve it by alternating direction method of multipliers (29) . 

Again we expect the ratio foL to converge to the value of 

IA I 

optimal y and the result of ICR be a sub-optimal solution for 
( 01 . ICR in both its versions is summarized in Algorithm 10 

To analyze the convergence properties of ICR, we first 
define the function /„ : R p —> R as follows: 

f n {x) =x T (A T A + 'kI)x-2y T Ax+Y j —|x,-| (12) 

< =1 | Pi | 

which is another form of the functions to be minimized at 
each iteration of ICR. With this definition and assuming a 
is a constant that a < 2 (q+p) ’ we P ro P ose the following two 
lemmas with proofs in the supplementary material 1 : 

Lemma 1. If \^ ( " o) | < ap^, then x^ ,0+1) = 0. (y/ « ^ = 0) 

A 

2 The Matlab code for ICR is made available online at http://signal.ee.psu. 
edu/ICR/ICRpage. htm 
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Algorithm 1 Iterative Convex Refinement (ICR) 

Input: A,K,y. 

initialize: p' ]) =A T y, iteration index n 1. 
while Stopping criterion not met do 

(1) Solve the convex optimization problem at iteration n: 

(Non-negative) For non-negative ICR solve 

* ( ") = argndn||y—Axil?+ A,| |j:| I 2 + EP'GIFTT (9) 

i=i p) 


(Unconstrained) For unconstrained ICR solve 


X (n) = argmin | \y ~Ax\\\ + X\\x\ \j + 

y q W 

In l)i 

(10) 


< =1 | Pi \ 


(2) Update ] 

i= 1 ,-,P 

(11) 

n k= 1 



(3) Increase iteration index n. 



end while if *(") — jc 1 "" 1 * < tol 



Output: ** = y* = -Ajy for all 

i = 1 ,-,P- 



1 M 1 

This lemma also implies that if K I < «p, for some n, 
then Xj will remain zero for all the following iterations. 

Lemma 2. If // / | > rap, for all n > no, then there exists 
Nj > no such that for all n > Nj we have 


,,(«+!) I 


1 ^ C 

,(«)I ~ n+ 1 


(13) 


satisfies the following property (assuming W < 1): 


< 


\fn+l{x (n] ) ~ fn{x {n) )\ = £p,(^y 

E Pi 


1=1 Wi 


1 ) 


■V 


K-" ‘Wp; 


1 

1 

1 ("11 

1 —1) 1 

P, \ 

I Pi \ 


+ E P< 


I (n) I I (n-1) 

| Hi | | Hi 


K-" ‘’l^ap,- 


< E Pi~\ x i\ < pmax{p,}- < 

* * n n 


c c 

< 

n n 


(15) 


This property also holds for x-" , Finally, We show that for 
n > No, a„ = ffx * 1 '"' 1 ) is Quasi-Cauchy. Since the minimum 
value /„ + i(jc(" +1 1 ) is smaller than f n+ \ (x ln ' > ), we can write: 


/„ +1 (*(" +1 )) -/„(*«) < fn + M n) ) ~/„(xW) < C ~, 

where we used for n > No- With the same reasoning for 
n > No we have: 


n 

Combining these two inequalities results (fl4) for n > No- □ 


Combination of this theorem with a reasonable stopping 
criterion guarantees the termination of the ICR algorithm. The 
stopping criteria used in this case is the norm of difference 
in the solutions in consecutive iterations. At termination 

where the solution converges, the ratio f n] will be zero for 

n) 

zero coefficients and approaches 1 for nonzero coefficients, 
which matches the value of y, in both cases. 


where c is some positive constant. 


IN. Experimental Validation 


Another interpretation of this lemma is that as the number 
of iterations grows, the cost functions at each iteration of ICR 
get closer to each other. In view of these two lemmas, we 
can show that the sequence of optimal cost function values 
obtained from ICR algorithm forms a Quasi-Cauchy sequence 
ED- In other words, this is a sequence of bounded values that 
their difference at two consecutive iterations gets smaller. 

Theorem 1. After a sufficiently large n, the sequence of 
optimal cost function values obtained from ICR forms a 
Quasi-Cauchy sequence, i.e. a n = f n (x^) is a Quasi-Cauchy 
sequence of numbers. 

\f n+l (x^)-f n (x^)\<-. (14) 

n 

Proof. We provide a sketch of the proof here, for more details 
please refer to the supplementary material 1 . 

Before proving the theorem, note that we can assume for a 
sufficiently large No, if n > No, then is either always 

less that ap 7 or always greater (details can be found in 
the supplementary material). We now proceed to prove the 
Theorem and show that for n > No, the sequence of f n {x"' ! ) 


We now apply the ICR method to sparse signal recovery 
problem using spike and slab priors. Two experimental sce¬ 
narios are considered: 1.) synthetic data and 2.) a real-world 
image recovery problem. In each case, comparisons are made 
against state of the art alternatives. 

Synthetic data: We set up a typical experiment for sparse 
recovery as in |24) , |36) with a randomly generated Gaussian 
matrix A £ and a sparse vector xo £ R p . Based on A 

and xo, we form the observation vector y £ W 1 according to 
the additive noise model: y=Axo+n with o = 0.01. The 
competitive state-of-the-art methods for spike and slab sparse 
recovery that we compare against are: (1) SpaRSA jig, @ 
which is a powerful method to solve the problems of the 
form <0- (2) Majorization Minimization (MM) algorithm [36j 
which aims to solve the spike and slab signal recovery problem 
through a majorization minimization approach. (3) Adaptive 
Elastic Net (39), (43| | based on a f \ relaxation of cost function. 
Initialization for all methods is consistent as suggested in }42) . 

Table [I] reports the experimental results for a small scale 
problem. We chose to first report results on a small scale prob¬ 
lem in order to be able to use the IBM ILOG CPLEX optimizer 
144] which is a very powerful optimization toolbox for solving 
many different optimization problems. It can also find the 
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Table I 

Comparison of methods for p = 64 and q = 32. On Average, 

STOPPING CRITERION FOR ICR IS ACHIEVED AT ITERATION n = 11. 


Method 

SpaRSA 

MM 

Elastic Net 

ICR 

Avg f(x*) 

1.8244E-2 

1.4721E-2 

3.1938E-2 

1.3379E-2 

MSE vs. x g 

9.2668E-4 

2.2290E-3 

2.9210E-4 

5.785IE-5 

SM vs. x g (%) 

84.46 

83.12 

67.46 

97.25 


Table II 

Comparison of methods for p = 512 and q = 128. On Average, 
STOPPING CRITERION FOR ICR IS ACHIEVED AT ITERATION n = 15. 


Method 

SpaRSA 

MM 

Elastic Net 

ICR 

Avg f(x*) 

4.7510E-2 

3.9953E-2 

4.7885E-2 

3.9127E-2 

MSE vs. Xo 

1.1740E-3 

2.5685E-3 

9.3378E-4 

3.9277E-4 

Sparsity Level 

55.91 

64.18 

85.30 

28.82 

SM vs. x 0 (%) 

89.68 

82.41 

87.61 

96.33 


global solution to non-convex and mixed-integer programming 
problems. We used this feature of CPLEX to compare ICR’s 
solution with the global minimizer. For obtaining the results 
in Table |I] we choose p = 64, q = 32 and the sparsity level of 
jco is 10. We generated 1000 realizations of A.xq and n and 
recovered x using different methods. Two different methods are 
used for evaluation of different sparse recovery methods: First, 
we compare different methods in terms of cost function value 
averaged over realizations, which is a direct measure of the 
quality of the solution to (|8j. Second, we compare performance 
of different methods from the sparse recovery point of view, 
and used the following figures of merit: mean square error 
(MSE) with respect to the global solution (jc^) obtained by 
CPFEX optimizer, “Support Match” (SM) measure indicating 
how much the support of each solution matches to that of x ? . 

As can be seen from Table |I] ICR outperforms the compet¬ 
ing methods in many different aspects. In particular from the 
first row, we infer that ICR is a better solution to ([8]i since it 
achieves a better minimum in average sense. Moreover, sig¬ 
nificantly higher support match (SM = 97.25% ) measure for 
ICR shows that ICR’s solution shows much more agreement 
with the global solution. Finally, the ICR solution is also the 
closest to the global solution obtained from CPFEX optimizer 
in the sense of MSE (by more than one order of magnitude 
in comparison with competing solutions). 

Next, we present results for a typical larger scale problem. 
We chose p = 512, q= 128 and set the sparsity level of xo to 
be 30 and carry out the same experiment as before. Because of 
the scale of the problem, the global solution is now unavailable 
and therefore, we compare the results against xo which is the 
“ground truth”. Results are reported in Table [II] Table [II] also 
additionally reports the average sparsity level of the solution 
and it can be seen that the sparsity level of ICR is the closest 
to the true sparsity level ofxo. In all other figures of merit, viz. 
the cost function value (averaged over realizations), MSE and 
support match vs. xo, ICR is again the best. Fig. [I] shows an 
alternate result as the MSE plotted against the sparsity level; 
once again the merits of ICR are readily apparent. 

Image reconstruction: In this part we aim to apply our ICR 
algorithm to real data for reconstruction of handwritten digit 
images from the well-known MNIST dataset |45’j. The MNIST 



Figure 1. Comparison of average MSE each method versus sparsity level of 
Jo- 



Figure 2. Examples of reconstructed images from MNIST dataset using 
different methods. The Numbers appeared next to each method is the average 
MSE for that method. On average, stoping criteria for ICR and ICR-NN are 
achieved at iteration n — 15 and n — 29, respectively. 


dataset contains 60000 digit images (0 to 9) of size 28 x 28 
pixels. Most of pixels in these images are inactive and zero 
and only a few take non-zero values. Thus, these images are 
naturally sparse and fit into the spike and slab model. We set 
up this experiments such that a sparse signal x (vectorized 
image) is to be reconstructed from a smaller set of random 
measurements y. For any particular image, we assume the 
smaller set of random measurement (150 measurements) is 
obtained by a Gaussian measurement matrix A £ R 150x784 
with added noise according to 0. We compare our result 
against the following state-of-the-art image recovery methods 
for sparse images: 1.) SALSA-TV which uses the variable 
splitting proposed by Figueiredo et al. J46] | combined with 
Total Variation (TV) regularizes ED- 2.) A Bayesian Image 
Reconstruction (BIR) [25|, based on a more recent version 
of Bayesian image reconstruction method |26) proposed by 
Hero et al.. We also compare our results with Adaptive Elastic 
Net method |39]| which is commonly used in sparse image 
recovery problems. Finally, we also show the result of the non¬ 
negative version of ICR (ICR-NN) which explicitly enforces a 
non-negativity constraint on x which in this case corresponds 
to the intensity of reconstructed image pixels. Recovered 
images are shown in Fig. [2] and the corresponding average 
reconstruction error (MSE) for the whole database for different 
methods appears next to each method. Clearly, ICR and ICR- 
NN outperform the other methods both visually and based on 
MSE value. It is also intuitively satisfying that ICR-NN which 
captures the non-negativity constraint natural to this problem, 
provides the best result overall. 
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V. Conclusion 

We develop a novel algorithm (ICR) for sparse recov¬ 
ery under spike and slab priors. Unlike known existing ap¬ 
proaches, ICR does not simplify the optimization by assump¬ 
tions/relaxations and hence affords a more general sparse 
structure. Experiments on synthetic data as well as a real- 
world image recovery problem confirms practical merits of 
ICR. Future research may investigate further analysis of ICR 
properties and extensions to multi-task sparse recovery under 
collaborative spike and slab priors. 

Appendix 

Experimental Validation 

In this Appendix, we show more experimental results from 
our framework to further support its significance in comparison 
with other state-of-the-art methods for spike and slab sparse 
recovery problem. Following the same experimental setup for 
synthetic data as in the letter, we illustrate the performance 
of the ICR in comparison with others as the sparsity level 
of *o (||jco|| o) changes. We vary the true sparsity level from 
only 5 non-zero elements in xq up to 95 and compared MSE, 
support match percentage and the resulting sparsity level of 
the solutions from each method. Again we choose the length 
of sparse signal to be p = 512 and number of observation to be 
q = 128. Matrix A and sparse vector xo are randomly generated 
and observation vector y is obtained by 0 with o = 0.01. 
1000 realization of A, xo and n are generated for each sparsity 
level and the results are averaged over these 1000 realizations. 
Figures [T] [3] and [4] illustrate these results. 

Fig. [3] illustrates that the support of ICR’s solution is the 
closest to the support of Jto- More than 90% match between 
the support of ICR’s solution and that of xo for a wide range of 
sparsity levels makes ICR very valuable to variable selection 
problems specially in Bayesian framework. Fig. [4] shows the 
actual sparsity level of solution for different methods. The 
dashed line corresponds to the true level of sparsity and ICR’s 
solutions is the closest to the dashed line implying that the 
level of sparsity of ICR’s solution matches the level of sparsity 
of xq more than other methods. This also support the results 
obtained from Fig. [3] 



True Sparsity level ||x 0 || 0 


Figure 3. Comparison of average support match of the solution in % for each 
method versus sparsity level of xq. 



Figure 4. Comparison of average sparsity level obtained by each method 
versus sparsity level of xq. Dashed line shows the true level of sparsity 


Appendix 

Analytical Results 

In this appendix, we present the proofs to the theoretical 
lemmas and theorems in the paper. For the rest of our analysis, 
without loss of generality we assume that |y,| < 1, i = l...q, 
|v, < 1, i = 1 ...p and columns of A have unity norm. We first 
begin with the proof to Proposition 1. 

Proposition 1. The MAP estimation in 0 is equivalent to the 
following minimization problem: 

(x*,f) = argmin | \y -Ax\\j + A,| |x| \j + £ P«Yi (16) 

** ti 

? ~ A 2i /2 tco 2 (1-k,-) 2 \ 

where p, = cr log ^ J. 

Proof. To perform the MAP estimation, note that the posterior 
probability is given by: 

/(*>Y> |A,y,A,,K) oc f(y\A,x,y,o 2 )f(x\y,o 2 ,'k)f(y\K). (17) 

The optimal jc*,y* are obtained by MAP estimation as: 

(x*,Y) = argmin{—21og/(*,y, |A,y,A,,ie)}. (18) 

xq 

We now separately evaluate each term on the right hand side 
of ( fl7) >. According to ([4]) we have: 


=> -2\ogf{y\A.x,y.a 2 ) = qdogo 2 +<?log(27t) +-^||y-Aj:|| 2 . 

0 " 

Since y, is assumed to be the indicator variable and only takes 
values 1 and 0, we can rewrite (|5]» in the following form: 

*|y,A,,0 2 ~ n ^ 
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Therefore 


/(x|y,cj 2 ,A) = 
p / 1 \ y< 

n( ^—9 / 1 M /9 ) eX P 


i V(2 tig 2 /A ) 1 / 2 


y< x i 


i 


( 2na 


2\ -lltiTl 


V *■ 


'2o 2 X-V 0 
1 p .Ip 


where e ; - is the / " basis function with one at component j 
and zeros elsewhere. x,- is the j th element of x ! " ,l+11 and x /, is 
equal to x 2 ' 0 111 except at j th element which is zero. We prove 
that if Iq 1 


(«o)| 


< ap 7 , then xj = 0 . 
.T t\T , 


exp \ - r.-? | ns, 


1—Y. 

'0 


fn 0 +i(xb) = xl(A t A + U)x b - 2y T Ax b + £ 


t 2\-Ae?=i Y; 


1*1 O \ T~~T „ 1 —Y; 


-21og/(x|Y,a 2 ,A) = 


(=t 

/„ 0 + i(x (n ° +1) ) = (x*+x;e y O r (^ 7 ’A +A/) (**+*,■«,■) 

Pi 


(«o) 


■\xb t \ 


a 2 A 


1*1 I 2 I l ( 2ttO“ \ r-i 7l ... ~ 

,i„„i 1 L^ _ 2 L ( 1 -Y;) lo g 5 o- 


T+log 


-2y l A{x b + x je j) 

\rj 


i=l \P, 


(«o) 


■\x bi \ 


1=1 i= 1 

In fact 80 = I(jc/ = 0) and the final term on the right hand side 
evaluates to zero, since I(x,- = 0 ) = 1 => logI(xi = 0 ) = 0 , and 
I(*i = 0) = 0 =t> Xi 7 ^ 0 =>■ Y; = 1 =t> (1 - Yi) = 0. 

Finally ([ 6 ]) implies that 

/(y|k) = iku-*) 1 ^ 


Therefore, their difference is: 

fn 0 +i{x (no+l) )-fn 0 +i{x b ) = x)e](A T A + Xl)e 

P j 1 


+2xjx h (A A + )J)ej — 2xjy‘ Aej 


= X j 


i 4 i 

Pv 


i=l 


(|x 7 |(A r A + A/); 

\**j I 

cy (y^Aey - xl(A t A + U)e 


(23) 


-21og/(Y|K) = -2 log kJ' +log(l- k ,-) 1 Yi 


i=l 
P 

-2 52 Yi log K, + 0 - Yi) log (1 Kj) 

i=l 
p 


We want to show that this difference is always positive 
except for xj = 0 which means Xj must be zero in order 
for /„ 0 +i(x ( ” 0+1) ) to be minimum. To do so, we show the 
following statements are true for nonzero xf. 


2^Yilog(y^) +log(l-K,) 2 Xj (y T Aej -xl(A T A + Xl)ej ) < \xj\ (|*;|(A r A + A /) ;7 + -^yr) 


p /i_ v .\- p 

lYi log —^ — 2 52 l°g(! — K /)- 

i=t V K ' / i=t 


4*2 


Plugging all these expressions back into ( p~ 8 j ) and neglecting 
constant terms, we obtain: 


y T Aej —x h AAe\ 


-hclej 


< 


(A^A + AJ) 


Pi 


(x*,Y*) = arg min q log a 


1 


x.y 


I \y-Ax\ 


a 2 A^ 


/2t ic 2 \p P /1-k A 2 

+ >»g(—JET. + ET.l°g(—)(19) 

Essentially, for fixed a 2 The cost function will reduce to: 

L(x, Y) = | |y —Ax\ | 2 + A, | |x| I 2 + ^ PiYi (20) 

i= 1 

where p ; - = O 2 log ^ 2710 ) • P 

Lemma 1. If |a'2" 0 ^| < a Pj> tlten x^ n ° +1 ' > = 0. (yj « = 0) 

Pj 

Proof. Assume that for a specific j, | < Otp 7 . Then for the 
next iteration the cost function to be minimized is as follows: 


<S=> 


<t=> 


y Ae i 


-xlA T Aej 


< (1 + A.) | Xj | + 


jj “r 

Pv 


M | 


(y -Ax b ) T Aej < (1 + A) \xj \ + ■ 


“? 0) I 

P> 

„(»o)I 


(24) 


In the above derivations, we used the fact that (A r A)jj = 1 
since columns of A have unity norm. On the other hand, 
Cauchy-Schwarz inequality implies that, 

(y — Ax b ) T Ae j < 2||y-A*fc[|.||Ae;|| = 2||y-Axfe|| 

< 2 ( 1^11 + 11 ^ 11 ) < 2 (yfq + p) 

Last inequality holds because of the fact that we assumed that 
magnitude of x,- and y, do not exceed one. Also since we 
assumed \p''j' n ' l \ < ctp/ and by definition of a we have: 


(1 +A)|.i 


/«o+i (x) = x T (A r A + A I)x- 2y T Ax + 52 


Pi 


'=1 Wi 


(»o) I 


(21) 


n^yy > ^ > 2 (g + p)> 2 (y/q + p) 

\Pj 


Assume that the argument that minimizes ( [ 2 T| is x 1 
can rewrite it in the following form: 

x ( " 0+1 ) =x b +xjej 


( ,! o+f 


. we 


( 22 ) 


Therefore, ( |24| is always true, since the right hand side is 
always greater than the left hand side. This implies that ( |23j ) 
is positive for nonzero x ; and, hence we must have Xj = 
0. Otherwise, it would contradict the fact that /(x^" 0+1 )) is 
the minimum value. Note that these are loose bounds and in 
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practice they are easily satisfied. For example, \y Ax/, | is 
practically very small. □ 

Lemma 2. If // ( | > rap , for all n > no, then there exists 
Nj > no such that for all n > Nj we have 


1 


1 


A t 


(«+U I 




< 


n+ 1 


(25) 


where c is some positive constant. 


Proof Assume \p " | > ap,- = e. First, note that it is straight¬ 
forward to see that the difference of consecutive average values 


has the following property: —+ < Ly 


(n+l) I 


77+I 

Now, let Nj = then for all n > Nj we have: 

| (n) | 1 ^ | (n+1) | ^ | («) | | 1 

\U : -< \u) < «;■ H- 

' 1 ' « +1 ' 1 ' ' 3 ' n +1 

where the left hand side is positive, since 


LA"' < _L_ 

\Pj I — n+l' 


(26) 


.,(«) 


1 1 ap, 

TT >a p 7-^ = ^ = 5 >0 


Using this fact and 
1 


I + I + n+l 

1 


< 


we infer that: 

1 1 


I (n+l) I 

1 + I 


,(« 


i 

77 +1 


1 

71+1 


1 

I ( n 


< 


(Inf 


— llu ( " 

n+l J \Pj 


< 


~ 1 I (n) 

1 + 

1_1_ 

77+1 



1 

1 

< 


1 (”+ 1 )i 
1+ \ 

1 (”) 1 
1+ 1 




1 

1 


1+1 

_1_ 

77+1 

l M l 
1+ 1 

i 

1 

< 


1 («+l)| 
1+ \ 

1 M1 
1+ 1 




1 

77+1 



(1 + 

1 n+l J 

1+ 1 


In the last expression, we have: 
RHS= ’ 


(«+ 1 )(K' ) |-stt)| 

-1 


LI IS = 


Therefore, 


n+l ) I P] I 


< 


> 


1 


(n + l)e5 

-1 


(»+D(K' , l + A)|nf| (” + 1 ) £S 


< 7- n > Nj. 

(n + l)e8 1 


1 

1 

1 (”+1)1 
1+ 1 

1 ( n ) 1 
1+ 1 


always less than ap, or always greater. Because according 
to Lemma 1, we know that if |+■ | once becomes smaller 
than ap, for some n, it will remain less than ap, for all 
the following iterations. Therefore, let tij, j = 1 ...p be the 
iteration index that for all n > nj, |p - | < e. Note that some 
tij’s may be equal to infinity which means they are never 
smaller than £. For those j that /; ,■ = °°, let Nj to be the same 
as Nj defined in proof of Lemma 2. With these definitions, 
we now proceed to prove the Theorem. We first show that for 
n > No = max(maxjnj,maxjNj), the sequence of /„(+')) has 
the following property: 

|/n+l (X ( " } ) - fn(x [n) ) | = I Pi (ti - ) l J 

1=1 | Pi | Pi | 


< 


E 




i Pi 


77 —1) I 


E pt 


i (n— l)i^ 

K l>e 


I Pi 


I Pi 


77—1) I 


< 


l^"“ 1) l>e 


c c c 

Pi—|x,-| < pmax{p,}- < -. 

n n n 


(28) 


This property also holds for jL" +l f Finally, We show that for 
n > No, a n = /„++ is Quasi-Cauchy. Since the minimum 
value /„ + i(jtf" +1 )) is smaller than /„ + i(xW), we can write: 


/ n+1 (x( n+1) )-/«(x (n) ) < /„+i(* W ) -/„(*«) < 


where we used 
n > No we have: 


for n > No- With the same reasoning for 


/»+l(x ( " +1) )-/»(^) > /„+l(^' !+1) ) -fn{x (n+l) ) > -- 

n 

Therefore, 

\f n+l (x^)-f n (x^)\< 
for n > No- □ 


c 

n 


Remark: Despite the fact that analytical results show a 
decay of order { in difference between consecutive optimal 
cost function values, ICR shows much faster convergence in 
practice. 


□ 


Theorem 1. After a sufficiently large n, the sequence of 
optimal cost function values obtained from ICR forms a 
Quasi-Cauchy sequence, i.e. a n = /„++ is a Quasi-Cauchy 
sequence of numbers. 

|/„+i(^” +1) )-/„(x W )|<- (27) 

i i n 

Proof. Before proving the theorem, note that we can assume 
for a sufficiently large No, if n > No, then ^p 1 " 1 is either 
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